home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
ASME's Mechanical Engine…ing Toolkit 1997 December
/
ASME's Mechanical Engineering Toolkit 1997 December.iso
/
main_frm
/
sctut1.txt
< prev
next >
Wrap
Text File
|
1988-06-03
|
30KB
|
914 lines
ááááSUPERCOMPUTING TUTORIAL Part 1.
Thσá folowinτ i≤ ß basiπ overvie≈ oµ Supercomputinτ anΣ somσ ì
ááááthσ ke∙ issue≤ withiε thi≤ marke⌠ segment« I⌠ wa≤ writteε fo≥ ì
ááááthσá PC-LI┬ BB╙ Supercompute≥ SI╟ (612⌐ 435-255┤á locateΣá iε ì
ááááBurnsville¼á Minnesotß b∙ ZIDE╦ Inc« Thσ Supercompute≥ SI╟ i≤ ì
áááádevoteΣáá t∩áá al∞á aspect≤áá oµáá Supercomputing¼áá Compute≥ ì
ááááArchitecture¼áá Paralle∞áá Processinτáá anΣá Scientifiπáá anΣ ì
ááááEngineering Software.
ááááZIDE╦ Inc« i≤ aε internationa∞ Supercompute≥ System≤ Service≤ ì
ááááand Consulting firm located at:
áááá 13195 Flamingo Court
áááá Apple Valley, Minnesota 55124
áááá Telephone: (612) 432-2835
áááá FAX (612) 432-4492
áááá Telex║ 910-576-0061.
ááááThi≤ tex⌠ ma∙ bσ freel∙ distributeΣ withiε thσ publiπ domaiε ì
ááááfor educational purposes provideΣátha⌠ credi⌠ anΣ
ááááacknowledgemen⌠ bσ giveε t∩ theáauthors at ZIDEK Inc.
ááááCopyright ZIDEK Inc. 1987
è
1
OVERVIEW
There are currently five manufacturers of supercomputers
that have their products in the marketplace. These are the
Cray X-MP, and Cray-2, CDC/ETA CYBER 205 and ETA-10, the
Fujitsu VP, the Hitachi S810; and the NEC SX. All of these
machines are vector processors; that is, a single computer
instruction may be used to call in a large number of
operands which then flow pair-wise through the one or more
arithmetic units or pipelines where the specified operation
is executed in a segmented, parallel or overlapped fashion.
Usually, the longer the length operand array (or vector) the
greater the effectiveness per operation.
However, in many cases users are faced with numerical
problems that cannot easily be organized in a vectorized
form or if they can, the vector length is very short. If
this is true for any significant part of the computation,
the overall performance slows to nearly the speed of a
single processor, the so called scalar speed of the machine.
This phenomena is often referred to as Amdahl's law. It is
for this reason that computer users and designers stress the
importance of fast scalar speed and frown on vector
technology and promote parallel processing instead. With
parallel processing, the problem is decomposed into a number
of (possibly interacting) subproblems and these are spread
among a plurality of closely coupled processors. In this
case, the decomposition need not be done on a functional
basis. But, if the problem cannot be broken into
approximately equal independent segments, the maximum
performance possible may not be sustainable in most designs.
It is also generally true, that whenever the problem is such
as to permit vector processing, it can in most cases be
reformulated as a parallel problem. There are also cases
where the problem cannot be vectorized, but is nevertheless
highly parallel. What is needed is a computer capable of
easily exploiting all the various forms of parallelism è
2
(including parallelism of the vector type) and which has the
fastest possible scalar speed. Recognizing this, current
supercomputer vendors are in a race to design, develop, and
introduce systems in the next decade that will embody these
desireable attributes.
3
When a program is transferred to a Cray (or other vector
processing computers), the Fortran language compiler detects
parallel portions and identifies those which can be
expressed in vector form. These portions are executed by the
vector unit; the rest (the scalar portion) of the program is
executed by the sequential portion of the machine, which in
most instances issues one instruction at a time.
IBM¼á DEC¼á CDC¼á UNISY╙á anΣá othe≥ majo≥ manufacturer≤á arσ ì
ááááprovidinτ tw∩ t∩ eigh⌠ CPU≤ witΦ whicΦ t∩ experiment«á ELXSI¼ ì
ááááIntel¼á BBN¼á Multiflow¼á Ncube¼á Thinkinτ Machines¼ Floatinτ ì
ááááPoin⌠á Systems¼á anΣá othe≥á ne≈á start-u≡á smal∞á firm≤á arσ ì
áááábuildinτ system≤ witΦ eveε ß large≥ numbe≥ oµ CPU's« However¼ ì
áááánonσá oµá thesσ appea≥ t∩ bσ comparablσá iεá genera∞á overal∞ ì
ááááperformancσ t∩ curren⌠ supercomputers.
There are only a few instances where any parallel processor
systems are capable of being utilized in a highly effective
way. This includes current supercomputers.
Supercomputers primarily execute scientific and engineering
programs; the overwhelming majority of these programs are
written in the high level language Fortran. Typical users
have thousands to hundreds of thousands of lines of existing
Fortran code which they regularly execute. Additionally, the
typical user regularly generates new programs (to solve new
scientific or engineering problems). For these new problems,
the language of choice for most of the users is Fortran,
although the languages PASCAL, MODULA-2, ADA, and "C" are
gaining some recognition.
Today, insufficient software exists for any of these systems
to demonstrate their true capability. Vector processor
software is far ahead of parallel processor software, but is
still inadequate to demonstrate the full capability of the
hardware. To date, very few programs have been written that
achieve more than 50% of the vector or parallel processing
capability. Little research work has gone into studying
optimization of programs for highly parallel computer
systems. The five current supercomputer manufacturers,
however, offer Fortan vectorizers/optimizers which enable
the user to interact with his program to provide for more
effective program code.
With respect to parallel supercomputing, there has been even
less optimizing work done. Cray Research and ETA Systems are
developing UNIX based systems that offer enhanced Fortran
parallel processing features based on the ANSI 8X Standard
for their new parallel supercomputer designs; Cray-2,
Cray-XMP, Cray-YMP, Cray-3, ETA-10, and ETA-30 respectivily.
Except for the recently announced HITACHI S-810-80, the
Japanese vendors have not yet embraced parallel
supercomputing in any forthcoming commercial design.
4
Many of the basic technological advances which are expected
in semiconductors, biotechnology, aircraft, nuclear power,
etc. depend for their realization the availability of higher
performance supercomputers. Consequently the hardware and
software aspects of high performance computing are the focus
of much research activity. There are at least 100
experimental parallel processing projects throughout the
world, mostly at universities. This research has identified
a number of key questions which include:
Hardware:
o What is the optimum interconnection method between
parallel processors?
o Will synchronization costs be high?
o How should parallelism be controlled?
o Can hardware be built to support both fine-grained or
coarse-grained parallelism?
o Can caching be synchronized or be made coherent for a
larger number of parallel processors?
o What level of granularity will provide the most
efficient execution?
o Can parallel architectures be extendible?
Software:
o Can automatic software to parallelize existing programs
be developed?
o Can existing languages support parallelism?
o What is the optimum granularity?
o How can algorithms and programs be mapped onto parallel
architectures?
o What are general guidelines for developing parallel
algorithms or programs?
o How does one debug a parallel processor?
o How can an operating system support parallel
processing?
All current commercially extant parallel processing systems
have to one extent or another addressed only subsets of
these issues in their respective designs.
Interconnection
The interconnection method affects speed and generality. It
affects the speed of a parallel processor because an
inadequate interconnection can create a bottleneck which
slows down computation. It affects generality because some
interconnection structures are well adapted to certain
computations but are poorly structured for others.
5
Control of Parallelism
There are two basic control philosophies: static and
dynamic. Static resource allocation requires a detailed
analysis of the computation, prior to execution. Such an
analysis is only available for a few specialized programs
and we are not aware that it can be achieved today by
compilers or other software in the general case. Dynamic
resource allocation has classically been achieved in
multiprocessors by the operating system. In theory, this
approach has general applicability. In practice it is not
particularly useful in speeding up a single large
computation, because the operating systems tend to introduce
high overhead.
Granularity
Some research approaches (eg. data flow) generally tend to
deal only with the parallelism in which elementary
operations such as add or multiply are considered
(fine-grained parallelism). Other approaches (eg.
multiprocessors such as the 4 processor Cray XMP or the
ETA10 can exploit parallelism only when the problem can be
decomposed into large, essentially independent
subcomputations (course grained parallelism).
6
Cache
Many designs employ cache memories to compensate for
relatively low performance interconnection designs or slow
main memory systems. This type of structure introduces the
coherency problem in which the data in different caches is
inconsistant. In the absence of cache, high performance
interconnection structures and fast main memory are
essential.
Level of Granularity
Different research machines are effective at different
levels of granularity, and consequently the research
community has addressed the problems of which level of
granularity is most efficient.
Extensibility
There is a widespread belief that applications of the future
will require hundreds or thousands of times the computing
power currently available. Moreover it is thought that these
applications will be highly ammenable to parallel processing
of some form. It is therefore desirable to have an
architecture which can be extended to very large designs
utilizing hundreds, thousands or millions of processors.
Such an architecture could be used whenever the price of
components decrease or as soon as customers are willing to
pay higher prices for larger machines.
7
Automatic parallelization
Supercomputers users have an enormous investment in
application programs, which must be protected. These
programs are mostly written in Fortran. Unfortunately there
can be no compiler which can make Fortran code highly
parallel throughout the computation. A Fortran program is
sometimes parallel and sometimes not. For this reason, the
research and special purpose machines perform comparatively
poorly on run of the mill Fortran programs. Because of their
fast scalar speed, current supercomputers show very high
performance for program segments which do not contain much
parallelism. Compilers have been in use for some time which
identify the parallelism available in Fortran programs.
These compilers have been used for vector processing
supercomputers as well as multiprocessors, such as Alliant.
Also available are Fortran pre-processors, such Pacific
Sierra's VAST and Kuck & Associates' KAP.
Existing Languages to support Parallelism
Because ordinary Fortran programs are usually not suitable
for research machines, some researchers have argued that
special languages, such as ID, OCCAM, SISAL, VAL, etc. might
be used to generate more parallelism, and perhaps enough to
achieve high performance. Unfortunately, these efforts are
as rule not applicable to the mainstream supercomputer
market, because the users are, at this time, either unable
to afford or unwilling to rewrite existing application
programs.
Optimum granularity
For most parallel architecture implementations, the
programmer must adjust the granularity to suit the
implementation.
Mapping
In many research machines, performance depends critically on
which sections of program code are assigned to which
processors, on which segments of data are assigned to which
memory units and similar issues. These issues are
collectivily known as the mapping problem. The challenge for
the programmer using vector or parallel machines is to
devise algorithms and arrange the computations so that the
architectural features of a particular machine are fully
utilized. General purpose machines are generally less likely
to demand this kind of mapping to achieve high performance.
8
Guidelines for developing parallel algorithms
As applications migrate to parallel computers, a central
questions becomes how algorithms should be written to
exploit parallelism. Forcing the algorithm designer or
programmer to figure out and program explicit parallel
control and synchronization is recognized as not being the
best approach. Explicit hand-coded algorithms introduce a
new level of complexity, complicates debugging by
introducing time-dependent errors, and can reduce
portability and robustness of algorithms by forcing the
recoding of programs for each different model of parallel
computer, and in some cases for the same computer, as
individual processors are added or removed for repair.
Unfortunately, inspite of massive worldwide research, no
general unifying principles for parallel processing have yet
been discovered.
9
Debugging
Debugging parallel algorithms is a difficult task indeed,
because of the intrinsic susceptibility of time-dependent
errors. Programmers dealing with real time events and
operating systems have been wrestling with the problems
since the beginning of the computer age. Undoubtedly this
will continue to be a problem. Even in the absence of
general unifying principles of parallel processing, it is a
problem that is difficult for all supercomputer vendors; and
all are devoting considerable resources to develop workable
debugging support programs.
Operating System Support
Many but not all multiprocessors use the operating system to
support parallelism. It seems to be a major problem to
prevent overheads from cancelling the benefits of
parallelism, when the operating system is used in this way.
10
The Market
Cray Research currently dominates the supercomputer market,
with approximately 63 percent of the worldwide installed
base. The market is characterized by an increasing
recognition of computer simulation or modeling as a highly
productive alternative to traditional experimental
engineering techniques such as prototype construction. The
market has been broadened by continuing price/performance
improvements and deepened by an increasing appetite for more
and more simulation.
Supercomputers are used in applications such as
environmental forecasting, aerodynamics, structural
analysis, nuclear research, computational physics,
astronomy, chemical engineering, fluid dynamics, molecular
science, image processing, graphics, electronic circuit
design, and combustion analysis.
The supercomputer market includes both government and
commercial sectors. U.S. Government laboratories and
agencies have been the historic testing grounds for
large-scale, innovative computers. They are the principle
targets for supercomputer systems. The U.S. Government
sector has a shorter selling cycle than the commercial
sector. U.S. Government installations have grown at a
compound rate of 37.3% over the last five years.
The rapidly developing commercial sector exists because of
dramatic price/performance improvements in supercomputers
during the last decade, the learning curve generally
experienced in the utilization of these systems, and because
recently hired university graduates that have used
supercomputers demand the fastest computers when they enter
industry. Commercial sector selling cycles can range from
six months to four years. Installations in the commercial
sector have grown at a compound rate of 47.3% over the last
five years.
11
Mini-supercomputers
The term is a misnomer. It could mean a physically small
supercomputer or, as it is used in the trade media, defines
computers that are less powerful than supercomputers. This
is also contradictory because the term supercomputers is
defined by authoritative sources as being "the most powerful
computers in any given point in time". In technical
publications, marketing literature, and general news media,
other terms that describe this market segment are such terms
as "Near", "Entry", and "affordable" supercomputers and such
variants as "crayettes" and "personal".
There is also a growing tendency to attach the label
"supercomputer" to almost any machine that employs vector,
multiprocessing or parallel computing architectural concepts
in its design.
Also seen are frequent references to architectures,
particularly those employing parallel processing and
dataflow techniques to the artificial intelligence domain as
supercomputers. Such "super intelligent" machines will no
doubt depend on many of the same technologies as
supercomputers, from both the architectural and device
technology point of view. But, in our opinion, at this point
in their development they are not powerful enough to be
included in the supercomputer category.
Some array processors are capable, in a highly restricted
range of applications, of achieving performance comparable
to that of supercomputers, but these cannot be categorized
as supercomputers because of their lack of general purpose
supercomputer capability across a broad spectrum of
applications.
Therefore these machines are not directly competitive with
supercomputers since these products have substantially lower
performance, much smaller memories, etc. This distinction is
also inherent: the cost of manufacture of a typical
supercomputer is several times the average selling price of
the typical mini-supercomputer.
12
This segment of the computer market is growing very rapidly
and there seems little doubt that some of the participants
will continue to grow, for the time being. While it might be
thought that this growth might erode supercomputer sales,
statistics, market research and our own experiences do not
support this view. It is more likely that the success of
"mini-supercomputers" will foster new applications in
scientific and engineering much as DEC's VAX has. This in
turn will drive a larger demand for supercomputers.
The major strength of this group of competitors is
price/performance advantages over minicomputer and
departmental-level scientific and engineering computing. The
group is actually composed of two distinct sub-groups; those
that offer compatibility with current Supercomputers i.e.
Cray, on either a total Operating System basis or on a
Fortran compiler basis, and those that are incompatible with
Supercomputers and manifest new architectural and software
innovations. Virtually all are supporting the UNIX
Operating System Environment.
The subgroup that offers Supercomputer software
compatibility can exploit the large collection of Scientific
and Engineering Applications Software that has already been
developed and vectorized. Therefore, by insuring
compatibility, applications software development is
minimized for these vendors.
The other subgroup comprising this market are the
non-compatibles. Because of frantic competition not only
with entrenched vendors and others within this segment, the
group is perhaps one of the most innovative. One of the most
promising of these, Alliant, has a very innovative parallel
and vector processor architecture and a Fortran Compiler
that supports both automatic vectorization and automatic
parallelization.
13
Other examples of innovative design include: Intel's iSPC
System based on the Stanford Univ. Cosmic Cube Architecture
(Ncube and FPS are also building variants of this design).
Culler Scientific and Multiflow is offering a system based
on a VLIW (Very long instruction word) design similar to
CDC's Cyberplus system. Elxsi is offering a multi parallel
processing system based on a very fast interprocessor system
bus architecture, and many others are building machines that
utilize hundreds and in some cases thousands of micro
processors.
Both groups offer systems with claimed superior
price/performance in the under $500k scientific/engineering
minicomputer market.
Compared to large-scale general purpose Scientific Systems
the major weakness of most of these systems is system
throughput and robust system software. Most are depending
either on the UNIX market or the Supercomputer applications
software market to fill the software void.
Another major weakness in this group is that most of the
entrants do not have adequate financial resources and
mature, marketing, sales, distribution, services and
corporate management infra-structures to compete massively
in both domestic and international markets against
established major vendors.
The incompatible and esoteric designs will encounter the key
bottlenecks: amount of old software and applications base
that can be converted to use "parallelism".
Existing minicomputer suppliers will have to respond to
price/performance pressure; parallel architecture is not the
only way.
The history of esoteric non-von Neumann architectures is
full of failed attempts at commercialization.
14
Outlook and Conclusion
The market has now has been estimated by many sources to be
in excess of $1 Billion worldwide, and at its current growth
rate may reach $2 Billion in the early 1990 time frame.
1) Architecture
Parallel architecture is generally agreed to be the
next step in higher performance supercomputing. Cray,
ETA, Supercomputers Inc., and other competitors are all
developing parallel supercomputers.
2) Software
A successful entry into the supercomputer market
requires a software system compatible with the existing
user environment, which permits not only new
applications but also protects the users with large
investments in software.
15